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ABSTRACT 

The paper presents analyses of score reliability for math placement 
tef:ts developed for use with undergraduate students. Subjects were 
all 589 students seeking admission to the math curricula during the 
course of the study. The paper also reviews some common 
misconceptions about reliability, e.g., the misconception that 
tests are reliable, and the misconception that longer tests always 
yield more reliable data than their shorter counterparts. It is 
found that the test and its subscales yield data with reasonably 
.sound measurement integrity. 



Counselors and faculty in mathematics departments have found 
the placement tests developed by the Mathematical Association of 
American (MAA) (1984) to be useful in providing optimal instruction 
for students. These tests were developed for college students and 
can be used to place students in courses ranging from remedial 
programs through calculus. However, as noted in the user's guide, 
the measurement integrity of test results must be evaluated in each 
test application. 

Too few researchers and educators realize that tests are not 
reliable or unreliable; rather, data have these characteristics, 
albeit data generated on a given measure administered with a given 
protocol to given subjects on given occasions. As Rowley (1976, p. 
53) notes, ••It needs to be established that an instrument itself 
is neither reliable nor unreliable." As Sax (1980, p. 261) 
explains. 

It is more accurate to talk about the reliability of 
measurements (data, scores, and observations) than 
the reliability of tests (questions, items, and 
other tasks) . Tests cannot be stable or unstable, 
but observations can. Any reference to the 
••reliability of a test'^ should always be interpreted 
to mean the ••reliability of measurements or 
observations [i.e., a particular set of data] 
derived from a test.^^ 
One important implication of the realization that reliability 
inures to data (rather than tests) is that reliability should 
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generally be explored whenever data are collected. 

Too few researchers realize that reliability is critical is 
to detecting substantiva effects or in making placement decisions. 
As Locke, Splrduso and Silverman (1987, p. 28) note, "the 
correlation between scores from two tests cannot exceed the square 
root of the product for reliability in each test." Thus, if a 
researcher is correlating scores having a reliability of .9 with 
scores having a reliability of .6, the correlation cannot exceed 
.73. Prospectively, researchers must select measures that will 
allow detection of effects at the level desired; retrospectively, 
researchers must take reliability into account when interpreting 
findings. 

The purpose of the present paper is to report two types of 
neasurement studies conducted with one set of the MAA placement 
■easures, i.e., a revised set of items employed in a previous 
report (Melancon & Thompson, 1989). First, various reliability 
analyses in which classical alpha coefficients were computed are 
reported; classical test theory is not without its limits (Algina, 
1990; Eason, in press; Thompson, 1989; Webb, Rowley & Shavelson, 
1988) , but internal consistency z Jliability estimates can be useful 
in applied studies, such as the present study. Second, various item 
analysis statistics (Thompson & Levitov, 1985) are also reported, 
since these also bear upon reliability. 

Subjects 

Subjects in a previous study (Melancon k Thompson, 1989) of 
MAA placement test items were 539 undergraduate students at a 
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private university in the south. The previous version of the 50- 
item test included three subscales, respectively measuring 
arithmetic (18 items), algebra (20 items), and finite math (12 
items) • 

Subjects in the present study were 589 undergraduate students 
at the same university. The revised version of the 50-item test 
included two subscales of 25 items each, an arithmetic scale and 
an algebra scale. Subjects were all the students who were seeking 
admission into the undergraduate mathematics curriculum. 

Coefficient al pha Analyses 

Alpha coefficients for the arithmetic (Y=18) , algebra (y=20) 
and finite math (Y=12) subtest scores in the previous study 
(Melancon & Thompson, 1989) were .60, .82, and .47, respectively. 
Coefficient alpha for the total score based on 50 items was .83. 
Based on these analyses the finite math subscale was dropped, and 
additional items were added to the first two scales. 

Application of the Spearman-Brown prophecy formula suggests 
that changing the arithmetic subscale a factor of }i=l» 388888 (25 
items / 18 items) would yield a new reliability of: 



r* » ( k * r ) 

(1.38 * 0.6 ) 

( 0.83 ) 

( 0.83 ) 

( 0.83 ) 



( 1 + ( k - 1 ) * r ) 
( 1 + (1.38 - 1 ) * 0.6 ) 
( 1 + (0.38 ) * 0.6 ) 

( 1 + 0.233 ) 

( 1.233 ) 

0.675, 



if the added items were of exactly the same quality as the previous 
items. 

Similarly, changing the algebra subscale by a factor of K=1.25 
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vas expected to yield a new reliability of: 



( k * r ) 

(1.25 *0.82 ) 

( 1.02 ) 

( 1.02 ) 

( 1.02 ) 



( 1 + ( k 

( 1 + (1.25 

( 1 + (0.25 
( 1 + 



1 ) * r ) 
1 ) * 0.82 ) 
) * 0.82 ) 



0.205 ) 
1.205 ) 
0.850 



The revised subscales actually had alpha coefficients of .75 
and .86, respectively. Alpha for the total scale was .89. The 
standard deviation for the subscales, respectively 3.9 and 5.7, 
appeared to be the basis for the obtained reliability estimate for 
the arithmetic scale data (*75), as against poor item quality. The 
students did better on arithmetic items, as might be expected, and 
vera relatively homogeneous with respect to their performance in 
this skill area. 

Variance of total scores is an important influence on 
calculated alpha, as can be seen from examination of the formula 
for computing alpha for dichotomously scored items: 



where ••Sp^'* is the sum of the item variances, and SD is the standard 
deviation of the total test or scale scores, since the variance of 
a dichotomously scored item equals the proportion of people who get 
the item right (a) times the proportion who get the item wrong (g 
» 1-fi) , even when item variance is at its maximum (y = g = .5)^ 
item variance tends to be a small value (fi*fl=.5*.5 = .25) 
that has a limited influence on alpha. Thus, alpha is heavily 
influenced by the variance of the total scores on a scale or a 
test, and is larger as the score variance gets larger. 



(V / (V - 1)) * (1 - (s„ / SD**2)), 



It is a misconception that longer tests inherently yield more 
reliable data than shorter tests (Melancon & Thompson, 1990; 
Thompson, 1990). it is total variance (Sfi**2) that largely drives 
reliability estimates. Total variances equals the sum of the item 
variances (Sp^) plus 2.0 times the sum of (each unique inter item 
correlation coefficient times the product of the standard 
deviations of the two items in each unique pairwise combination of 
items). Thus, when new items are added to a test, if they are 
negatively correlated with the previous items, the total test 
variance will actually go down! 

Item Analyses 

Table 1 presents item difficulty (e) and Sfi ( (p * g)**.5) 
statistics for the present study. The table also presents scale 
(Y=25) alpha-if-deleted statistics for each item, and the corrected 
correlation between scores on each item ("O" or "i") and total 
scores on each scale (conceivably ranging "0" to ••24'') when the 
given item is omitted from the scale score (rlxS) . Finally, the 
table presents test (y=50) alpha-if-deleted statistics for each 
item, and the corrected correlation between scores on each item 
(MO" or "l") and total scores on the test (conceivably ranging "0" 
to "49 ") when the given item is omitted from the scale score 
(rIxT) . 

INSERT TABLE 1 ABOUT HERE . 



Discussion 

The revisions of the MAA tests yielded slightly better than 



expected alpha coefficients for both the arithmetic (.75) and the 
algebra (.86) scales. The somewhat lower reliability of the 
arithmetic scale appears to be related to a higher mean on the 
scale and more homogeneous performance (SD=3.9), an occurrence that 
might be expected in a college setting. As indicated by the Table 
1 results, the 50 items were all positively correlated with scale 
and test scores, thus each item adds to total score variance, and 
makes a positive contribution to score reliability. 

Overall, the results suggest that the revised test can yield 
data that are reasonably reliable. No items detracting from test 
performance were noted. It is therefore suggested that the revised 
test may be useful in placing students into an appropriate 
curriculum, though the measurement integrity of data from the test 
should be continuously reviewed as part of future applications of 
the measure. 



Algina, J. (1990). Elements of classical reliability theory and 
generalizability theory. In B. Thompson (Ed.), Advances in 
social s cience methodolocrv (Vol. 1, pp. 137-169). Greenwich, 
CT: JAI Press. 

Eason, s. (in press) . Why generalizability theory yields better 
results than classical test theory: A primer with concrete 
examples. In B. Thompson (Ed.), Advances in educational 

regeargh; substanti ve findings, methodological developments 

(Vol. 1). Greenwich, CT: JAI Press. 

Locke, L.F., Spirduso, W.W., & Silverman, S.J. (1987). P roposals 
that WPrK; — a guide for planning dissertations and grant 
proposals (2nd ed.). Newbury Park, CA: SAGE. 

Mathematical Association of America. (1984). User's guide: The 
Placement test program of the Mathematical Association of 
America (3rd ed.). Washington, DC: Author. 

Melancon, J., & Thompson, B. (1989, January). Local norms and test 
characteristics for selected forms of the MAA Math Placement 
X£S^' Paper presented at the annual meeting of the Southwest 
Educational Re--:»arch Association, Houston. (ERIC Document 
Reproduction Service No. ED 303 488) 

Melancon, J., & Thompson, B. (1990). Maximizing test reliability 
by stepwise variable deletion: A case study with the Finding 
Embedded Figures Test. Perceptual and Motor Skills , ifi, 99-110. 

Rowley, G.L. (1976). The reliability of observational measures. 
American Education al Research Journal . H, 51-59. 



7 



T 



Sax, G. (1980). Principles of educational and Dsvcholoolcal 

neasurement and evaluation (2nd ed.)> Belncnt, CA: Wadsworth. 
Thompson, B. (1989, January). Whv aenera lizabiHty coefficienta 

arg flP gggential aspect of reliabil t tv assessment-, . Paper 

presented at the annual meeting of the Southwest Educational 

Research Association, Houston. 
Thompson, B. (1990) . ALPHAMAX: A program that maximizes coefficient 

alpha by selective item deletion. Educational and Pavcholoqi ngi 

Measurement . 585-589. 
Thompson, B. , & Levitov, J.E. (1985). Using microcomputers to score 

and evaluate test items. Collegiate Microcomputer , i, 163-168. 
Webb, N.M. , Rowley, G.L. , t Shavelson, R.J. (1988). Usirg 

generalizability theory in counseling and development. 

Meaguremgnt and Evalu ation i n counseling and Development , ^x, 

81-90. 



ERIC 



8 

n 



Table 1 

Item Analysis for the 50 Item Test 









Scale 




Test 




Item 






alpha if 




alpha if 




P 


SD 


Delete 


rlxS 


Delete 


rIxT 


Arithmetic Scale 












1 


. 90 


.30 


.75 


.14 


.89 


.09 




• 94 


. 24 


.75 


.22 


.89 


.24 


<> 

J 


A ^ 

. 96 


. 19 


.76 


.02 


.89 


.03 


4 


• 82 


. 39 


.74 


.33 


.89 


.39 


D 


• 77 


.42 


.73 


.44 


.89 


.44 


O 


A 4 

• 91 


. 29 


.75 


. 15 


.89 


.18 


7 


. 77 


.42 


.75 


.28 


.89 


.29 


Q 

o 


• 62 


. 49 


.73 


.42 


.89 


.43 


A 

9 


• 86 


.35 


.74 


.30 


.89 


.30 


10 


A A 

• 90 


.31 


.75 


.28 


.89 


.33 


11 


• 65 


.48 


.74 


.29 


.89 


.35 


12 


A ^ 

• 96 


. 19 


.75 


.12 


.89 


.15 


13 


. 87 


. 34 


.74 


.33 


.89 


.40 


14 


. 57 


.50 


.76 


.15 


.89 


.12 


ID 


• 55 


.50 


.75 


.25 


.89 


.25 


lo 


• 63 


.48 


.73 


.52 


.89 


.48 


17 


• 50 


.50 


.74 


.40 


.89 


.39 


lo 


• 34 


.47 


.75 


.28 


.89 


.29 


19 


• 66 


.47 


.75 


.27 


.89 


.22 


2U 


• 83 


. 38 


.74 


.33 


.89 


.31 


O 1 
Z X 


D Q 


. 33 


.74 


.34 


.89 


.40 




• /I 


.46 


.74 


.40 


.89 


.40 


Z J 


• 45 


. 50 


.75 


. 25 


.89 


.23 




• /7 


.42 


.74 


.32 


.89 


.28 


Z 9 


• o4 


. 37 


.75 


. 15 


.89 


.18 




Scale 












zo 


Q A 

• <54 


.36 


.86 


.41 


.89 


.43 


z / 


• Dl 


. 50 


.86 


.50 


.89 


.49 


OA 
ZO 


• /2 


.45 


. 86 


. 53 


.89 


.52 


Z 7 


• /O 


. 43 


.86 


. 53 


.89 


.54 


'^O 
JU 


• 9Z 


. 28 


.86 


. 33 


.89 


.36 


J X 


• /D 


.43 


. 86 


.50 


.89 


.52 




• Oo 


. 47 


.86 


.48 


.89 


.49 


33 


.88 


.33 


.86 


.37 


.89 


.38 


34 


.81 


.39 


.86 


. 27 


.89 


.31 


35 


.60 


. 49 


. 86 


. 53 


.89 


.48 


36 


.64 


. 48 


.86 


. 33 


.89 


.33 


37 


.55 


.50 


.85 


.60 


.89 


.58 


38 


.75 


.43 


.86 


.33 


.89 


.32 


39 


.48 


.50 


.86 


.42 


.89 


.40 


40 


.40 


.49 


.86 


.50 


.89 


.47 


41 


.65 


.48 


.86 


.49 


.89 


.51 


42 


.76 


.43 


.86 


.44 


.89 


.42 


43 


.27 


.44 


.86 


.50 


.89 


.47 


44 


.56 


.50 


.86 


.53 


.89 


.49 
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45 
46 
47 
48 
49 
50 



.49 
.50 
.44 
.50 
.54 
.47 



.50 
.50 
.50 
.50 
.50 
.50 



.86 
.86 
.86 
.87 
.86 
.86 



.36 
,32 
,36 
,20 
,25 
,42 



.89 
.89 
.89 
.89 
.89 
.89 



.33 
.30 
.35 
.21 
.28 
.45 
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